Text-Classification-and-Context-Mining-for-Document-Summarization

Affiliation

Centre For Indian Language Technology, Indian Institute of Technology, Bombay

Dec 2018 - July 2019

Applied ML and NLP Research under Dr. Ganesh Ramakrishnan, CS Dept., IIT Bombay

Summary

Automnomously attempting a categorical summarization of a sparse, asymmetrical corpus in English language, by performing text classification - which is achieved by our intuitive sentence pair classification scenarios and usecases. There are two algorithms at hand - sentence pair simililarity and keyword based context mining. The objective is to semantically summarize and classify asymmetrical English corpuses such as consumer feedbacks, reviews, requests and complaints. This project is being used by a citizen wellness based NGO in Mumbai: CIVIS

Getting Started

Coming soon!

Modules

BERT for sentence pair classification

Fine-tuning a pretrained bert model on a custom dataset for sentence pair classification.

Built With

PyTorch
TorchVision
CUDA >=7.0
BERT

Datasets

The following datasets were sampled for groundtruth as well as for training and testing the model. Validation data for model evaluation will be released soon.

Openreviews papers and review comments train and test data here
Wikipedia articles with talk/edit comments train and test data here

Sentence Pair Similarity (Algorithm + Implementation)

This has been developed for labelling a pair of sentences with a similarity score based on the cosine similarity of their word vectors, cross-referenced from the BOW (Bag of Words). It is inherently an unsupervised text alignment problem solved using a graph based approach. The vectors are also validated using the tfidf matrix at the document level. Please cite when using this algorithm.

Algorithm here.
Results here

Keyword Based Context Mining (Algorithm + Implementation)

This algorithm has been developed for mining all the context vectors in a sentence which lie closest to a given keyword in the 3d word embedding space. This is used for filtering the consumer complaints and aiding a keyword-based search of complaints from amongst the tens of thousands of instances. Please cite when using this algorithm.

Algorithm here.
Results here

Built With

Gensim wordmodel
Gensim word vectors

Datasets

Google Word Embeddings (3D) pretrained vectors for 3 million words from Google news (Pre-trained)
Train and test data to be released soon.

Web Crawlers

Two web crawlers have been developed for populating the train and test data, from openreviews as well as wikipedia as well. Feel free to use fork them and develop on top of them as well. Please cite a reference!

OpenReviews Crawler

This crawler uses the json source available at openreviews.net/notes and parses the same to segregate the papers and their review comments.

Wikipedia Crawler

This crawler uses the media wiki api and fetches the revisions pertiaining to that article. The xml response is then parsed and processed into a format to proceed with the data sampling.

Contribution

Please feel free to raise issues and fix any existing ones. Further details can be found in our code of conduct.

While making a PR, please make sure you:

Always start your PR description with "Fixes #issue_number", if you're fixing an issue.
Briefly mention the purpose of the PR, along with the tools/libraries you have used. It would be great if you could be version specific.
Briefly mention what logic you used to implement the changes/upgrades.
Provide in-code review comments on GitHub to highlight specific LOC if deemed necessary.
Please provide snapshots if deemed necessary.
Update readme if required.

References

The parser for wikipedia talk comments was referenced from https://github.com/bencabrera/grawitas - Thank you for the source code and a detailed insight on how to use it.

Cabrera, B., Steinert, L., Ross, B. (2017). Grawitas: A Grammar-based Wikipedia Talk Page Parser. Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 21-24.

Also, a shoutout to JasonKessler for gisting up a nice kickstart to crawl openreviews.net!

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.vscode		.vscode
Mappings		Mappings
bert		bert
pre		pre
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md

License

Chintan2108/Text-Classification-and-Context-Mining-for-Document-Summarization

Folders and files

Latest commit

History

Repository files navigation

Text-Classification-and-Context-Mining-for-Document-Summarization

Affiliation

Centre For Indian Language Technology, Indian Institute of Technology, Bombay

Dec 2018 - July 2019

Applied ML and NLP Research under Dr. Ganesh Ramakrishnan, CS Dept., IIT Bombay

Summary

Getting Started

Modules

BERT for sentence pair classification

Built With

Datasets

Sentence Pair Similarity (Algorithm + Implementation)

Keyword Based Context Mining (Algorithm + Implementation)

Built With

Datasets

Web Crawlers

OpenReviews Crawler

Wikipedia Crawler

Contribution

While making a PR, please make sure you:

References

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages